Copy tokenizer files in each of their repo #10624

sgugger · 2021-03-10T04:10:56Z

What does this PR do?

This PR cleans the maps in the tokenizer files to make sure each checkpoint has the proper tokenization files. This will allow us to remove custom code that mapped some checkpoints to special files (like BART using RoBERTa vocab files) and take full advantage of the versioning systems for those checkpoints. All checkpoints changed have been properly copied in the corresponding model repos in parallel.

For instance, to accomodate the move on the fast BART tokenizers, the following commits have been on the model hub:

In the PR I've also uniformized the way the maps are structured across models, to make it easier to alter (and ultimately remove) them in the future via automatic scripts.

LysandreJik

Fantastic work! Pinging @thomwolf and @julien-c for review, as this is the first step of removing archive maps from the repository.

julien-c · 2021-03-10T12:37:08Z

Love it!

Maybe a good practice to link to a sample of the related commits on hf.co: for instance here https://huggingface.co/facebook/bart-base/commit/c2469fb7e666a5c5629a161f17c9ef23c85217f7

sgugger · 2021-03-10T12:38:12Z

I think I did around 50 of them in various repos to move all the tokenizers files, so a bit hard to keep track of all of them.

julien-c · 2021-03-10T12:44:18Z

Yep just link one, or a small sample.

Makes it easier to see what this PR entails on hf-hub side

* Move tokenizer files in each repo * Fix mBART50 tests * Fix mBART tests * Fix Marian tests * Update templates

sgugger added 4 commits March 9, 2021 22:11

Move tokenizer files in each repo

d67e883

Fix mBART50 tests

187f345

Fix mBART tests

4750998

Fix Marian tests

fd33f07

sgugger requested a review from LysandreJik March 10, 2021 04:10

LysandreJik approved these changes Mar 10, 2021

View reviewed changes

Update templates

a63dc16

sgugger merged commit 2295d78 into master Mar 10, 2021

sgugger deleted the tokenizer_file_map branch March 10, 2021 16:26

sgugger mentioned this pull request Mar 10, 2021

Remove special treatment for custom vocab files #10637

Merged

LysandreJik mentioned this pull request Mar 11, 2021

Unable To Load Pretrained Longformer Models' Tokenizers #10642

Closed

Iwontbecreative pushed a commit to Iwontbecreative/transformers that referenced this pull request Jul 15, 2021

Copy tokenizer files in each of their repo (huggingface#10624)

c002c5b

* Move tokenizer files in each repo * Fix mBART50 tests * Fix mBART tests * Fix Marian tests * Update templates

vmaryasin mentioned this pull request Oct 18, 2021

Bug in the Flaubert tokenizer_config.json do_lowercase option #14042

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Copy tokenizer files in each of their repo #10624

Copy tokenizer files in each of their repo #10624

sgugger commented Mar 10, 2021 •

edited

Loading

LysandreJik left a comment

julien-c commented Mar 10, 2021

sgugger commented Mar 10, 2021

julien-c commented Mar 10, 2021 •

edited

Loading

Copy tokenizer files in each of their repo #10624

Copy tokenizer files in each of their repo #10624

Conversation

sgugger commented Mar 10, 2021 • edited Loading

What does this PR do?

LysandreJik left a comment

Choose a reason for hiding this comment

julien-c commented Mar 10, 2021

sgugger commented Mar 10, 2021

julien-c commented Mar 10, 2021 • edited Loading

sgugger commented Mar 10, 2021 •

edited

Loading

julien-c commented Mar 10, 2021 •

edited

Loading